Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
                                            Some full text articles may not yet be available without a charge during the embargo (administrative interval).
                                        
                                        
                                        
                                            
                                                
                                             What is a DOI Number?
                                        
                                    
                                
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
- 
            Deep learning accelerators are important tools for feeding the growing demand for deep learning applications. The automated design of such accelerators--which is important for reducing development costs--can be viewed as a search over a vast and complex design space that consists of all possible accelerators and all the possible software that could run on them. Unfortunately, this search is complicated by the existence of many ordinal and categorical values, which are critical to explore for the ultimate design but are not handled well by existing search techniques. This paper presents a technique for efficiently searching this space by injecting domain information--in this case information about hardware/software (HW/SW) co-design--into the automated search process. Specifically, this paper introduces a novel Bayesian optimization framework called daBO (domain-aware BO) that accepts domain information as input, including those describing ordinal and categorical values. This paper also introduces Spotlight, a design tool based on daBO, and this paper empirically shows that Spotlight produces accelerator designs and software schedules that are orders of magnitude better than those created by the state-of-the-art. For example, for the ResNet-50 deep learning model, Spotlight produces a HW/SW configuration that reduces delay by 135x over the configuration produced by ConfuciuX, a state-of-the-art HW/SW co-design tool, and Spotlight reduces energy-delay product (EDP) by 44x over an Eyeriss-like accelerator, which is an edge-scale hand-designed accelerator. In the realm of cloud-scale accelerators, Spotlight reduces the EDP of a scaled-up Eyeriss-like accelerator by 23x. Our evaluation shows that Spotlight benefits from the efficiency of daBO, which allows Spotlight to identify accelerator designs and software schedules that prior work cannot identify.more » « less
- 
            The past decade has seen the rise of highly successful cache replacement policies that are based on binary prediction. For example, the Hawkeye policy learns whether lines loaded by a given PC are Cache Friendly (likely to remain in the cache if Belady’s MIN policy had been used) or Cache Averse (likely to be evicted by Belady’s MIN policy). In this paper, we instead present a cache replacement policy that is based on multiclass prediction, which allows it to directly mimic Belady’s MIN policy in a surprisingly simple and effective way. Our policy uses a PC-based predictor to learn each cache line’s reuse distance; it then evicts lines based on their predicted time of reuse. We show that our use of multiclass prediction is more effective than binary prediction because it allows for a finer-grained ordering of cache lines during eviction and because it is more robust to prediction errors.Our empirical results show that our new policy, which we refer to as Mockingjay, outperforms the previous state-of-the-art on both single-core and multi-core platforms and both with and without a prefetcher. For example, with no prefetcher, on a mix of 100 multi-core workloads from the SPEC 2006, SPEC 2017, and GAP benchmark suites, Mockingjay sees an average improvement over LRU of 15.2%, compared to 7.6% for SHiP and 12.9% for Hawkeye. On a single-core platform, Mockingjay’s improvement over LRU is 5.7%, which approaches the 6.0% improvement of Belady MIN’s unrealizable policy. On a single-core platform (with a prefetcher) running the high-MPKI CVP workloads, Mockingjay’s improvement over LRU is 20.1%, compared to 13.4% for Hawkeye.more » « less
- 
            Temporal prefetchers are powerful because they can prefetch irregular sequences of memory accesses, but temporal prefetchers are commercially infeasible because they store large amounts of metadata in DRAM. This paper presents Triage, the first temporal data prefetcher that does not require off-chip metadata. Triage builds on two insights: (1) Metadata are not equally useful, so the less useful metadata need not be saved, and (2) for irregular workloads, it is more profitable to use portions of the LLC to store metadata than data. We also introduce novel schemes to identify useful metadata, to compress metadata, and to determine the fraction of the LLC to dedicate for metadata.more » « less
- 
            This paper presents a new Single Source Shortest Path (SSSP) algorithm for GPUs. Our key advancement is an improved work scheduler, which is central to the performance of SSSP algorithms. Previous GPU solutions for SSSP use simple work schedulers that can be implemented efficiently on GPUs but that produce low quality schedules. Such solutions yield poor work efficiency and can underutilize the hardware due to a lack of parallelism. Our solution introduces a more sophisticated work scheduler---based on a novel highly parallel approximate priority queue---that produces high quality schedules while being efficiently implementable on GPUs. To evaluate our solution, we use 226 graph inputs from the Lonestar 4.0 benchmark suite and the SuiteSparse Matrix Collection, and we find that our solution outperforms the previous state-of-the-art solution by an average of 2.9×, showing that an efficient work scheduling mechanism can be implemented on GPUs without sacrificing schedule quality. While this paper focuses on the SSSP problem, it has broader implications for the use of GPUs, illustrating that seemingly ill-suited data structures, such as priority queues, can be efficiently implemented for GPUs if we use the proper software structure.more » « less
- 
            This paper presents Voyager, a novel neural network for data prefetching. Unlike previous neural models for prefetching, which are limited to learning delta correlations, our model can also learn address correlations, which are important for prefetching irregular sequences of memory accesses. The key to our solution is its hierarchical structure that separates addresses into pages and offsets and that introduces a mechanism for learning important relations among pages and offsets. Voyager provides significant prediction benefits over current data prefetchers. For a set of irregular programs from the SPEC 2006 and GAP benchmark suites, Voyager sees an average IPC improvement of 41.6% over a system with no prefetcher, compared with 21.7% and 28.2%, respectively, for idealized Domino and ISB prefetchers. We also find that for two commercial workloads for which current data prefetchers see very little benefit, Voyager dramatically improves both accuracy and coverage. At present, slow training and prediction preclude neural models from being practically used in hardware, but Voyager’s overheads are significantly lower—in every dimension—than those of previous neural models. For example, computation cost is reduced by 15- 20×, and storage overhead is reduced by 110-200×. Thus, Voyager represents a significant step towards a practical neural prefetcher.more » « less
- 
            This paper extends the reach of General Purpose GPU programming by presenting a software architecture that supports efficient fine-grained synchronization over global memory. The key idea is to transform global synchronization into global communication so that conflicts are serialized at the thread block level. With this structure, the threads within each thread block can synchronize using low latency, high-bandwidth local scratchpad memory. To enable this architecture, we implement a scalable and efficient message passing library. Using Nvidia GTX 1080 ti GPUs, we evaluate our new software architecture by using it to solve a set of five irregular problems on a variety of workloads. We find that on average, our solutions improve performance over carefully tuned state-of-the-art solutions by 3.6×.more » « less
- 
            Temporal prefetchers have the potential to prefetch arbitrary memory access patterns, but they require large amounts of metadata that must typically be stored in DRAM. In 2013, the Irregular Stream Buffer (ISB), showed how this metadata could be cached on chip and managed implicitly by synchronizing its contents with that of the TLB. This paper reveals the inefficiency of that approach and presents a new metadata management scheme that uses a simple metadata prefetcher to feed the metadata cache. The result is the Managed ISB (MISB), a temporal prefetcher that significantly advances the state-of-the-art in terms of both traffic overhead and IPC. Using a highly accurate proprietary simulator for single-core workloads, and using the ChampSim simulator for multi-core workloads, we evaluate MISB on programs from the SPEC CPU 2006 and CloudSuite benchmarks suites. Our results show that for single-core workloads, MISB improves performance by 22.7%, compared to 10.6% for an idealized STMS and 4.5% for a realistic ISB. MISB also significantly reduces off-chip traffic; for SPEC, MISB's traffic overhead of 70% is roughly one fifth of STMS's (342%) and one sixth of ISB's (411%). On 4-core multi-programmed workloads, MISB improves performance by 27.5%, compared to 13.6% for idealized STMS. For CloudSuite, MISB improves performance by 12.8% (vs. 6.0% for idealized STMS), while achieving a traffic reduction of 7 × (83.5% for MISB vs. 572.3% for STMS).more » « less
- 
            null (Ed.)State-of-the-art value predictors either use control-flow context or data context to predict values. Predictors based on control-flow context use branch histories to remember past values, but these predictors require lengthy histories to predict anything other than constant and strided values. Predictors that use data context---also known as Finite Context Method (FCM) predictors---use a history of past values to predict a broader class of values, but such predictors achieve low coverage due to long training times, and they can become complex due to speculative value histories. We observe that the combination of branch and value history provides better predictability than the use of each history separately because it can predict values in control-dependent sequences of values. Furthermore, the combination improves training time by enabling accurate predictions to be made with shorter history, and it simplifies the hardware design by removing the need for speculative value histories. Based on these observations, we propose a new unlimited budget value predictor, Heterogeneous-Context Value Predictor (HCVP), that when hybridized with E-Stride, achieves a geometric mean IPC of 3.88 on the 135 public traces, as compared to 3.81 for the current leader of the Championship Value Prediction.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                     Full Text Available
                                                Full Text Available